Technical notes

Office hours: Friday, 5.00 - 6.30 pm

Materials: All the material can be found in the Moodle page

  • Slides at the end of each lecture
  • R code pre-LAB/: To complete (together) some lines of code
  • R code post-LAB: It will replace the R code pre-LAB after the lecture

First Lab

Today’s goals. Learn about:

  1. Bivariate Normal distribution

  2. Central Limit Theorem (CLT)

  3. Law of Large Numbers (LLN)

Workflow

  1. Consider examples related to theoretical concepts
  2. Using R to solve some basic questions and to understand important aspects

Let’s start

  1. Download the R code (It is briefly commented)
  2. You will find XXX where I will ask you to code a bit
  3. Part of the code after XXX, depends on successfully filling these line of code

Multivariate Normal Distribution

Multivariate Normal Distribution

Recall

  • \({\bf X} = (X_1, \ldots, X_n)^\top\) (n-dimensional random vectors), \({\bf X} \sim \mathcal{N}_n\,(\boldsymbol \mu, \boldsymbol \Sigma)\)

  • Mean vector: \(\mu = (\mu_1, \ldots, \mu_n)^\top\), where \(\mu_j= E(X_j)\), \(j=1, \ldots, n\)

  • Covariance matrix: \(\boldsymbol \Sigma\) is a \(n \times n\) matrix whose entries \(\Sigma_{jk}\) represents

\[\Sigma_{jj} = \sigma^2_j = {\text{var}}(X_j), \quad \Sigma_{jk}=\Sigma_{kj} = \sigma_{jk} = {\text cov}(X_j,X_k) \]

  • Density function

\[f_{{\bf X}}({\bf x}) = \frac{1}{(2 \pi)^{n/2}|\boldsymbol \Sigma|^{1/2}} e^{-\frac{1}{2}({\bf x} - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1} ({\bf x} - \boldsymbol \mu)} \]

Bivariate Normal Distribution

Multivariate normal with \(n = 2\)

  • \({\bf X} = \begin{pmatrix}{\bf X_1} \\ {\bf X_2} \end{pmatrix} \sim \mathcal{N}\bigg(\begin{pmatrix}\mu_1 \\ \mu_2 \end{pmatrix}), \begin{pmatrix} \sigma^2_1 & \sigma_{12} \\ \sigma_{12}& \sigma^2_2 \end{pmatrix}\bigg)\)

  • Determinant of \(\boldsymbol \Sigma\) (\(\rho\) is the correlation coefficient): \[|\boldsymbol \Sigma| = \sigma^2_1 \sigma^2_2 - (\sigma_{12})^2 = \sigma^2_1 \sigma^2_2 \bigg( 1- \frac{(\sigma_{12})^2}{\sigma^2_1 \sigma^2_2}\bigg) = \sigma^2_1 \sigma^2_2 ( 1 - \rho^2)\]

  • Precision matrix: \(\boldsymbol \Sigma^{-1} = \frac{1}{\sigma^2_1 \sigma^2_2 ( 1 - \rho^2)} \begin{pmatrix} \sigma^2_2 & -\sigma_{12} \\ -\sigma_{12} & \sigma^2_1 \end{pmatrix}\)

  • \[f_{{\bf X}}({\bf x}) = \bigg((2 \pi)\sigma_1\sigma_2 \sqrt{1-\rho^2}\bigg)^{-1} e^{-\frac{1}{2(1-\rho^2)}\bigg(\bigg(\frac{x_1 - \mu_1}{\sigma_1}\bigg)^2 + \bigg(\frac{x_2 - \mu_2}{\sigma_2}\bigg)^2 -2\rho\frac{(x_1 - \mu_1)(x_2 - \mu_2)}{\sigma_1\sigma_2} \bigg)}\]

A special case: The Bivariate Normal Distribution

Time to work with R

  • Write down a function for computing the value of the density function for a bivariate Gaussian in \((x_1,x_2)\) having mean vector components equal to zero and unity variances, that is

\[f_{(X_1,X_2)}(x_1,x_2) = \frac{1}{(2 \pi)\sqrt{1-\rho^2}} e^{-\frac{1}{2(1-\rho^2)}(x^2_1 + x^2_2 - 2\rho x_1x_2)}\]

NBiv <- function(x,y, rho){
  if (!is.numeric(x) || !is.numeric(y)) stop("x and y must be numeric scalars")
  if (!is.numeric(rho) || length(rho) != 1L) stop("rho must be a single number")
  if (abs(rho) >= 1) stop("rho must be a scalar in (-1, 1)")
  
  den <- 2 * pi * sqrt(1 - rho^2)
  num <- exp(-0.5 * (x^2 + y^2 -2*rho*x*y) / (1 - rho^2))
  return(num/den)
}

Bivariate Normal Distribution: 3D plots

xx <- yy <- seq(-3, 3, 0.1)
z <- outer(xx, yy, NBiv, rho = 0.5)
par(mfrow = c(1, 2))
persp(xx, yy, z, xlab = "x", ylab = "y", zlab = "f(x,y)", cex.lab = 3)
persp(xx, yy, z, theta = 30, phi = 30, xlab = "x", ylab = "y", zlab = "f(x,y)", cex.lab = 3)

Bivariate Normal Distribution: Contour level

rho <- c(-0.5, 0, 0.9)
z <- list()
for(j in 1:3) z[[j]] <- outer(xx, yy, NBiv, rho = rho[j])
par(mfrow = c(1, 3))
for(j in 1:3) contour(xx, yy, z[[j]], main = paste0("rho = ", rho[j]), 
                      cex.main = 3, cex.axis = 3, labcex = 2)

Bivariate Normal Distribution: 3D Surface plot and Contour levels

Bivariate Normal Distribution: Example

Biometric example

  • Anthropometric data analysis represents an important stumbling block of statistics analysis (around 19th and early 20th Centuries, Quetelet, Galton, Pearson, etc. investigates biometric measurements)
  • Here, we consider two measurements, the weight and the height of an adult
  • We are simplifying the phenomenon (both are dependent on the gender, geographical characteristics, and so on)

Assumption

  • Let \(X_1\) and \(X_2\) the r.v. representing the height and the weight measurements
  • We assume the random vector \((X_1,X_2)\) follows a bivariate normal distribution

Bivariate Normal Distribution: Some properties

\({\bf X} = ({X_1}, {X_2})^\top \sim \mathcal{N}(\boldsymbol \mu, \boldsymbol \Sigma)\)

Get the marginal distribution \(f(X_j)(x_j) = \int_{-\infty}^{+\infty}f(x_j,x_k)dx_k\)

  • So \(X_1 \sim \mathcal{N}(\mu_1, \sigma^2_1)\) and \(X_2 \sim \mathcal{N}(\mu_2, \sigma^2_2)\)

Get the conditional distribution \(f_{X_1|X_2}(x_1|x_2) = f(x_1, x_2)/f(x_2)\)

  • So \(X_1 | X_2 = x_2 \sim \mathcal{N}(\mu_1 + \rho \frac{\sigma_1}{\sigma_2}(x_2 - \mu_2), \sigma^2_1(1-\rho)^2)\)
  • And \(X_2 | X_1 = x_1 \sim \mathcal{N}(\mu_2 + \rho \frac{\sigma_2}{\sigma_1}(x_1 - \mu_1), \sigma^2_2(1-\rho)^2)\)

Random generation

  • Leverage \((\tilde X_1, \tilde X_2)^\top = \boldsymbol \Sigma^{-1/2} (\bf X - \boldsymbol \mu) \sim \mathcal{N}(\bf 0, \bf I)\)

Bivariate Normal Distribution: A useful R package

{mvtnorm} R package

  • The function we built handles a special case
  • Here we use the capabilities of mvtnorm, install/load it
library(mvtnorm)

Assumptions (height in cm, and weight in kg)

  • Assume: \(\mu_1 = 176, \mu_2 = 85.5, \sigma_1 = 7, \sigma_2 = 14.2, \rho = 0.47\)
mu_h <- 176
mu_w <- 85.52 
sd_h <- 7 
sd_w <- 14.24
rho <- 0.47

Bivariate Normal Distribution: Example using rmvnorm() function

Generate 10000 values

  • rmvnorm() requires the mean vector and the covariance matrix
cov <- rho * sd_h * sd_w
Sigma <- matrix(c(sd_h^2, cov, cov, sd_w^2), 2,2, byrow=T)
X <- rmvnorm(10000, c(mu_h, mu_w), Sigma)

Bivariate Normal Distribution: Marginals

Histograms and overlapped Gaussian density

  • Graphical check to see if the target density encompass the simulated data
par(mfrow = c(1,2))
hist(X[,1], prob = TRUE, breaks = 30, cex.lab = 2)
curve(dnorm(x, mu_h, sd_h), col = "red", add=TRUE)
hist(X[,2], prob = TRUE, breaks = 30)
curve(dnorm(x, mu_w, sd_w), col = "red", add=TRUE)

Bivariate Normal Distribution: Conditionals

x1.fix <- 180
muw_c <- mu_w + rho * (sd_w/sd_h) * (x1.fix - mu_h)  
sdw_c <- sqrt(sd_w^2*(1-rho^2))
c(sd_w, sdw_c)
[1] 14.24000 12.56917
x2.fix <- 50
muh_c <- mu_h + rho * (sd_h/sd_w) * (x2.fix - mu_w)  
sdh_c <- sqrt(sd_h^2*(1-rho^2))
c(sd_h, sdh_c)
[1] 7.000000 6.178665

Note

  • Note that the variance of the conditional distribution is lower of the variance of the marginal distributions

Bivariate Normal Distribution: Conditionals

par(mfrow = c(1,2))
curve(dnorm(x, muw_c, sdw_c), from = 50, to = 130, cex.lab = 2, cex.axis = 2, xlab = "x2")
curve(dnorm(x, muh_c, sdh_c), from = 140, to = 200, cex.lab = 2, cex.axis = 2, xlab = "x1")

Bivariate Normal Distribution: Bivariate Plots

Scatterplot and contour plot

  • Plot only 1000 observations and overlap the contour plot
idx <- sample(1:10000, size = 1000, replace = FALSE) 
xx <- seq(min(X[idx, 1]), max(X[idx, 1]), length.out = 500)
yy <- seq(min(X[idx, 2]), max(X[idx, 1]), length.out = 500)
zz <- outer(X = xx, Y = yy, FUN = function(x,y) 
  dmvnorm(cbind(x,y), mean = c(mu_h, mu_w), sigma = Sigma))

Bivariate Normal Distribution: Bivariate Plots

plot(X[,1], X[,2], xlab = "x1", ylab = "x2", cex.axis = 2, cex.lab = 2)
contour(xx, yy, zz, add = TRUE, col = "red", nlevels = 20)

Central Limit Theorem

Central Limit Theorem (CLT)

Recall the CLT

  • \(X_1,\ldots, X_n\): sequence of independent and identically distributed (iid) random variables (rv) from a distribution having
    \[E(X_i) = \mu, \quad \quad \text{var}(X_i) = \sigma^2 < \infty, \quad i=1, \ldots, n\]
  • For large n (\(\dot \sim\) indicates convergence in distribution) \[ \bar X_n = \frac{1}{n} \sum_{i=1}^{n}X_i \, \dot \sim \, \mathcal{N}(\mu, \sigma^2/n) \quad S_n = \sum_{i=1}^{n}X_i \, \dot \sim \, \mathcal{N}(n\mu, n\sigma^2)\]
  • The CLT supports the normal approximation to the distribution of a random variable (rv) that can be viewed as the sum of other rvs

Approximation with CLT: application

The CLT is useful for computing some quantities

Example

  • Let \(X\) and \(Y\) be two independent rv, such that \(X \sim {\rm Bin}(n,p)\) and \(Y \sim {\rm Bin}(m, q)\)
  • We are interested in computing the probability \(\mathbb{P}(X>Y)\)

The Normal approximation is the simplest way to compute \(\mathbb{P}(X>Y)\)

  • \(X \, \approx \, \mathcal{N}(np, np(1-p)), \quad \quad \quad \ Y \, \approx \, \mathcal{N}(mq, mq(1-q))\)
  • \(W=X-Y\) is still normal (well known probability result), having \[\mu_W=\mu_X-\mu_Y= np-mq \quad \quad \sigma^2_W=\sigma^2_X+\sigma^2_Y= np(1-p)+mq(1-q)\]
  • \(\mathbb{P}(X>Y) = \mathbb{P}(W>0)\) can be computed easily

Approximation with CLT: Example

Waterpolo match

  • Problem: Tomorrow two professional Italian waterpolo teams, Posillipo and Pro Recco, compete against each other.

  • Goal: We are interested in computing the probability that Posillipo win the next match against Pro Recco

Assumptions and quantity of interest

  • Let \(X\) (\(Y\)) be the random number of goals scored by Posillipo (Pro Recco), and assume that \(X, Y\) follow two independent Binomial distributions

  • \(X\) (\(Y\)) represents the number of shots converted in goal on the total number of shots \(n\) ( \(m\)) made by Posillipo (Pro Recco), having probability \(p\) (\(q\))

  • Probability that Posillipo win: \(\mathbb{P}(X>Y)=\mathbb{P}(X-Y>0)=?\)

Approximation with CLT: Example

Waterpolo match

  • We adopt a simplification, and we treat the quantities \(p,q,m,n\) as known. For instance, based on historical experience we fix \[p=0.5, \quad q=0.7, \quad n=20, \quad m=20\]
  • Let \(W\) be the r.v., such that \(W=X-Y\), we can easily compute the probability of interest (\(\mathbb{P}(X - Y > 0) = \mathbb{P}(W>0) =?\)) leveraging the CLT \[W \approx \mathcal{N}(\mu_W = \mu_X-\mu_Y, \sigma^2_W = \sigma^2_X + \sigma^2_Y)\]

What is your guess on \(\mathbb{P}(W>0)\)?

  1. Around 0.5;
  2. A small value (close to 0);
  3. A large value (close to 1)

Approximation with CLT: Example

Waterpolo match

  • We adopt a simplification, and we treat the quantities \(p,q,m,n\) as known. For instance, based on historical experience we fix \[p=0.5, \quad q=0.7, \quad n=20, \quad m=20\]
  • Let \(W\) be the r.v., such that \(W=X-Y\), we can easily compute the probability of interest (\(\mathbb{P}(X - Y > 0) = \mathbb{P}(W>0) =?\)) leveraging the CLT \[W \approx \mathcal{N}(\mu_W = \mu_X-\mu_Y, \sigma^2_W = \sigma^2_X + \sigma^2_Y)\]

Time to work with R and obtaining a value for the probability of interest

  • In the Lab R file we will complete the missing part of the code (denoted with XXX)

Approximation with CLT: R code

p <- 0.5
q <- 0.7 
n <- m <- 20
mW <-  p * n - q * m
varW <- n * p * (1 - p) + m * q * (1 - q)
sdW <- sqrt(varW)
# P(X > Y) = P(W > 0) = ? 
PWin_P <- pnorm(0, mean = mW, 
                sd = sdW, 
                lower.tail = FALSE)
PWin_P
[1] 0.09362452

Quick comments

  • This value of the probability is based on the assumptions we did
  • However, according to our guessing there is low probability that Posillipo win the next match against Pro Recco

Approximation with CLT: R code

PMF of \(X\) (red) and \(Y\) (blue) and the pdf of \(W=X-Y\)

curve(dnorm(x, mW, sdW), xlim = c(-12, 20), ylim = c(0, 0.3),
      xlab = "", ylab = "", cex.axis = 2.5)
points(0 : n, dbinom(0 : n, n, p), pch = 21, bg = 1, col = "blue")
points(0 : m, dbinom(0 : m, m, q), pch = 21, bg = 2)
segments(x0 = 0, y0 = -0.01, x1 = 0, y1 = dnorm(0, mW, sdW), lwd = 2)
text(14.25, 0.22, "Y", cex = 3, col = 2)
text(10, 0.2, "X", cex = 3, col = "blue")
text(-4, 0.15, "X - Y", cex = 3, col = 1)

Approximation with CLT: A Note

Continuity correction

  • When discrete distributions are approximated by continuous distributions, it is a good practice to apply a correction (c.c.). In this case: \[\small{\mathbb{P}(X>Y) \approx \mathbb{P}(W>0) \stackrel{c.c.}{\approx} \mathbb{P}(W > 0.5)}\]
PWin_P_cc <- pnorm(0.5, mean = mW, 
                  sd = sdW, 
                  lower.tail = FALSE)
PWin_P_cc
[1] 0.06895673
PWin_P # Without c.c.
[1] 0.09362452

In general

  • Let \(X\) a discrete r.v. that you approximate with a continuos one (\(\tilde X\)), the continuity corrections are

\[\small{\mathbb{P}(X > x) \approx \mathbb{P}(\tilde X > x) \stackrel{c.c}\approx \mathbb{P}(\tilde X > x + 0.5)}\] \[\small{\mathbb{P}(X \geq x) \approx \mathbb{P}(\tilde X \geq x) \stackrel{c.c}\approx \mathbb{P}(\tilde X > x - 0.5)}\] \[\small{\mathbb{P}(X \leq x) \approx \mathbb{P}(\tilde X \leq x) \stackrel{c.c}\approx \mathbb{P}(\tilde X < x + 0.5)}\] \[\small{\mathbb{P}(X < x) \approx \mathbb{P}(\tilde X < x) \stackrel{c.c}\approx \mathbb{P}(\tilde X < x - 0.5)}\]

Approximation with CLT: A comment

Ideas?

  • There are several potential limits of our probabilistic model, but one can deserve your attention.
  • What could be one of the limit of the previous analysis, leaving aside considerations on the assumptions we did?

Normal approximation when \(n\) is not large

  • We are leveraging the CLT for approximating the Binomial distribution with the Gaussian one
  • But when \(n\) is not large the approximation can be poor
  • The question on “how large should be \(n\) to consider good the approximation?” deserve a comment: It depends on specific conditions and only sometimes you can follow the rule of thumb that \(n\) greater than 30/50 guarantees a good approximation

Approximation with CLT: Investigation

Ideas?

  • Let’s explore the quality of the approximation using out example * The Normal approximation is the simplest way to compute \(\mathbb{P}(W>0)\), but you can obtain the results differently

Obtain the exact result

  • Is the difference \(X-Y\) a well-known probability distribution?

No, you must compute the probability law

  • You can obtain the result (with a bit of effort) by using probability calculus \[\small{\mathbb{P}(W = w) = \sum_{x=0}^{m-w} \binom{n}{x} p^{x}(1-p)^{n-x} \binom{m}{x-w} q^{x-w}(1-q)^{m - (x-w)}}\]

Exact result: Naive R implementation

Obtain the probability distribution

  • There is much space to improve this naive R implementation
Px <- dbinom(0 : n, n, p = p) 
Py <- dbinom(0: m, m, p = q) 
Pw <- rep(NA, 2 * n + 1)

count <- 1; count2 <- 2 * n + 1
for(i in 0 : n){
 idx1 <- 1 : (i + 1)
 idx2 <- (n + 1 - i) : (n + 1)
 if(i == n){
   Pw[count] <- sum(Px[idx1] * Py[idx2])  
 } else { 
 Pw[count] <- sum(Px[idx1] * Py[idx2]) 
 Pw[count2] <- sum(Px[idx2] * Py[idx1])
 }
 count <- count + 1
 count2 <- count2 - 1
}
sum(Pw) # check 
[1] 1

Exact result: Comparisons

Compare approximate solutions (with and without c.c.) with the exact one

  • Slight difference between the exact solution and the approximated one
sum(Pw[(n + 2) : length(Pw)]) # P(W>0 )
[1] 0.06995932
PWin_P                        # appr. P(W>0 )
[1] 0.09362452
PWin_P_cc                     # appr. with c.c. P(W>0 )
[1] 0.06895673

Exact result: Graphical Comparisons

plot(-n : m, Pw, xlab = "w", col = "red", cex.axis = 1.5, cex= 2)
curve(dnorm(x, mW, sdW), add = TRUE)

The role of \(n\) on the approximation

Let’s explore a case with lower n

  • Leaving unchanged the rest, consider \(n = m = 5\)
  • The quality of the approximation (without c.c.) deteriorates
sum(Pw[(n + 2) : length(Pw)]) # P(W>0)
[1] 0.1606744
PWin_P # P(W>0) approx.
[1] 0.2548257
PWin_P_cc # P(W>0) approx. c.c.
[1] 0.1613143

Exact result: Graphical Comparisons

plot(-n : m, Pw, xlab = "w", col = "red")
curve(dnorm(x, mW, sdW), add=TRUE)

The role of \(n\) on the approximation

Let’s explore a case with larger n

  • Leaving unchanged the rest, consider \(n = m = 100\)
  • The quality of the approximation (without c.c.) improves remarkably
sum(Pw[(n + 2) : length(Pw)]) # P(W>0 )
[1] 0.001365914
PWin_P
[1] 0.00159485
PWin_P_cc
[1] 0.001253232

Extra: A different approach?

Waterpolo match

  • Rather than specifying in advance the total number of unknown shots and the converting shots probabilities, i.e. 4 parameters, one could be tempted to use a more flexible distribution accounting just for the scoring intensity, regardless of the number of shots

  • For this purpose, the Poisson distribution seems suitable. We may assume two independent Poisson distributions for the number of goals of the upcoming match: \(X\sim \text{Pois}(\lambda), \ Y \sim \text{Pois}(\mu)\)

  • At this stage, we need to specify the rates, for instance upon our own knowledge about waterpolo goal abilities we fix \(\lambda=5, \mu=7\)

Extra: A different approach?

Exercise

  • Leveraging the Poisson assumption on the number of goals scored, how can we estimate now the winning probability for Posillipo, \(\mathbb{P}(X>Y)=\mathbb{P}(X-Y>0)\)?
  • Obtain \(\mathbb{P}(X-Y>0)\) using the exact and approximate solution and compare them

Hint to obtain the exact solution

  • We may use the following probability result: \(Z = X-Y \sim \text{PD}(\lambda-\mu, \lambda+\mu)\), where \(\text{PD}\) stands for the Poisson difference distribution, also known as Skellam distribution, with mean \(\lambda-\mu\) and variance \(\lambda+\mu\) (use the skellam R package)

Law of large numbers

Law of large numbers (LLN)

Recall the LLN

Let \(X_1, \ldots, X_n\), be a sequence of independent and identically distributed r.v., such that \(E[X_i] =\mu\) and \(\text{var}(X_i) = \sigma^2 < +\infty\). Then, given the sample mean \[\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i, \] we have that, for \(\epsilon>0\),

\[\lim_{n \to \infty} \mathbb{P}(|\bar X_n - \mu| \geq \epsilon)=0. \]

Idea in practice

  • Suppose tossing a coin \(n\) times and let \(k\) the number of heads obtained. Then \(k/n\) is the proportion of heads obtained in \(n\) tosses. If the coin is fair then we can guess that the proportion is not too far from 0.5 (unlikely it should be exactly 0.5). Although it could happen than, due to the chance, we can observe anomalies represented by large or small number of heads, by increasing the number of tosses such a strange behavior disappear.

Law of large numbers (LLN): Example

LLN_ex <- function(n, p = 0.5, seed = 1234){
  par(mfrow = c(1, 2), mar = c(4, 4, 1, 1))
  set.seed(seed)
  #Each trajectory is a realization of a sequence of rv
  # First trajectory
  x <- sample(c(0, 1), n, prob = c(1 - p, p), replace = T)
  plot(1 : n, cumsum(x)/(1 : n), type = "l", ylim = c(0, 1), 
       ylab = expression(n[Head]/n), xlab = "n") 
  abline(h = .5, lty = 2, col = "red")
  
  # Second trajectory
  x <- sample(c(0, 1), n, prob = c(1 - p, p), replace = T)
  points(1 : n, cumsum(x)/(1 : n), type = "l", col = "grey")
  
  plot(1 : n, cumsum(x)/(1 : n), type = "l", ylim = c(0, 1), 
       ylab = expression(n[Head]/n), xlab="n", col = "gray80")
  
  # Instead, considering 100 trajectories
  for(i in 1 : 100){
    x <- sample(c(0, 1), n, prob = c(1 - p, p), replace = T)
    points(1 : n, cumsum(x)/(1 : n), type = "l", col = "gray80", lwd = .5)
  }
  abline(h = .5, lty = 2, col = "red")
}

Law of large numbers (LLN): Example \((n = 500)\)

n <- 500
LLN_ex(n)

Law of large numbers (LLN): Example \((n = 5000)\)

n <- 5000
LLN_ex(n)